Using Frame Semantics for Knowledge Extraction from Twitter
نویسندگان
چکیده
Knowledge bases have the potential to advance artificial intelligence, but often suffer from recall problems, i.e., lack of knowledge of new entities and relations. On the contrary, social media such as Twitter provide abundance of data, in a timely manner: information spreads at an incredible pace and is posted long before it makes it into more commonly used resources for knowledge extraction. In this paper we address the question whether we can exploit social media to extract new facts, which may at first seem like finding needles in haystacks. We collect tweets about 60 entities in Freebase and compare four methods to extract binary relation candidates, based on syntactic and semantic parsing and simple mechanism for factuality scoring. The extracted facts are manually evaluated in terms of their correctness and relevance for search. We show that moving from bottom-up syntactic or semantic dependency parsing formalisms to top-down framesemantic processing improves the robustness of knowledge extraction, producing more intelligible fact candidates of better quality. In order to evaluate the quality of frame semantic parsing on Twitter intrinsically, we make a multiply frame-annotated dataset of tweets publicly available. Knowledge extraction has primarily focused on mining Wikipedia and newswire data. For this reason, knowledge bases used for search such as Freebase suffer from low recall, only covering certain entity types, and only certain facts about those entities. Freebase, for example, contains the fact that the Walt Disney Company is a production company, but not that it, for instance, owns Marvel. A common problem in knowledge extraction is what is known as the reporting bias (Gordon and van Durme 2013), i.e., the fact that a lot of common knowledge is never made explicit. Social media platforms like Twitter have the potential to fill some of that gap, since they offer very different facts than what can be found in Wikipedia. People may tweet an obvious fact to inform their friends what they just realized, as a means of sarcasm, or simply to kill time. Finally, Twitter is a platform that potentially allows us to harvest facts in almost real time. E.g., a company may buy Copyright c © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. up another company and tweet about it, long before the fact makes it into Wikipedia or Freebase. On the other hand, extracting useful facts from Twitter is a hard problem. Tweets often contain opinionated non-factual text, automated posts from third-party websites, and/or temporary facts that are irrelevant to search. True facts seem like needles in haystacks, but, on the other hand, the haystacks are plentiful on Twitter. There is also another reason that knowledge extraction from Twitter is hard. Most approaches to knowledge extraction rely on syntactico-semantic processing, and stateof-the-art parsing models fair badly on Twitter data (Foster et al. 2011), to the extent that it is prohibitive for downstream applications such as knowledge extraction. In this paper, we show that top-down frame semantic parsing is more robust to the domain shift from newswire to Twitter than other syntactico-semantic formalisms, and that this leads to more robust knowledge extraction. In particular, while syntactic and semantic dependency parsing models induced from newswire exhibit dramatic drops when applied to Twitter data, frame semantic parsing models seem to perform almost the same across domains. Our Approach We select 60 entities in Freebase distributed equally across persons, locations and organizations (see Table 1), and extract 70k tweets mentioning at least one of these entities. The data was collected during the summer 2014. We part of speech (POS) tag these tweets and pass the augmented tweets on to four different extraction models: a syntactic dependency parser, a semantic role labeler, a frame semantic parser, and a rule-based off-the-shelf (REVERB) open information extraction system (Fader, Soderland, and Etzioni 2011). For all systems, except REVERB, we apply the same heuristics to filter out relevant facts and rank them in terms of factuality using sentiment analysis. We evaluate facts in terms of their wellformedness, their correctness, and their relevance. We also ask subjects to rate triples in terms of opinionatedness for the sake of error analysis. Finally, we check the extracted facts for novelty against Freebase. Frame Semantic Parsing Frame semantic parsing is the task of assigning frames (Fillmore 1982) to text. Frames combine word sense disambiguation and semantic role labeling. The go-to ressource Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence
منابع مشابه
Automatic Hashtag Recommendation in Social Networking and Microblogging Platforms Using a Knowledge-Intensive Content-based Approach
In social networking/microblogging environments, #tag is often used for categorizing messages and marking their key points. Also, since some social networks such as twitter apply restrictions on the number of characters in messages, #tags can serve as a useful tool for helping users express their messages. In this paper, a new knowledge-intensive content-based #tag recommendation system is intr...
متن کاملApplication of Frame Semantics to Teaching Seeing and Hearing Vocabulary to Iranian EFL Learners
A term in one language rarely has an absolute synonymous meaning in the same language; besides, it rarely has an equivalent meaning in an L2. English synonyms of seeing and hearing are particularly grammatically and semantically different. Frame semantics is a good tool for discovering differences between synonymous words in L2 and differences between supposed L1 and L2 equivalents. Vocabulary ...
متن کاملOntology-based Approach for Semantic Data Extraction from Social Big Data: State-of-the-art and Research Directions
A challenge of managing and extracting useful knowledge from social media data sources has attracted much attention from academic and industry. To address this challenge, semantic analysis of textual data is focused in this paper. We propose an ontology-based approach to extract semantics of textual data and define the domain of data. In other words, we semantically analyse the social data at t...
متن کاملGeneralisations over Corpus-induced Frame Assignment Rules
In this paper we discuss motivations and strategies for generalising over instance-based frame assignment rules that we extract from frame-annotated corpora. Corpus-induced syntax-semantics mapping rules for frame assignment can be used for automatic semantic role labelling of unparsed text, but further, to extract linguistic knowledge for a lexical semantic resource with a general syntax-seman...
متن کاملMedical Event Extraction using Frame Semantics - Challenges and Opportunities
The aim of this paper is to present some findings from a study into how a large scale semantic resource, FrameNet, can be applied for event extraction in the (Swedish) biomedical domain. Combining lexical resources with domain specific knowledge provide a powerful modeling mechanism that can be utilized for event extraction and other advanced text miningrelated activities. The results, from dev...
متن کامل